home *** CD-ROM | disk | FTP | other *** search
- "Design.doc" for WB-tree File Based Associative String Data Base.
- Copyright (C) 1991, 1992, 1993 Holland Mark Martin
-
- This is an attempt to make explicit the major decisions
- we've either made explicitly or arrived at though experience
- in the course of working on the wb-tree database.
-
- Since there are actually a lot of them,
- I've divided this into the following areas:
-
- I. B-Tree Structure and Access
- II. Concurrency
- III. Buffer, I/O, and Free-List Management
- IV. Error Handling
- V. Misc
-
- last mod:
- oct 28/1992 rjz reorganized and expanded
- nov 19/1992 rjz recorded recent ananlysis of
- UPDATE screw and deferred updates
- dec 2-3/1992 rjz random elaboratins and edits
- dec 8-9/1992 rjz RANGE doc and extensions
- dec 16 rjz misc editing
-
- jan 8/1993 rjz doc new deferred-tree-update algorithms
- jan 13-15 rjz more of same; add descr of concurrency
- problems w/free list, error codes, misc,
- deferred data update proposal.
-
- ---------------------------------------------------------------
-
- I. B-Tree Structure and Access
-
- We used SAGIV 86 as our starting point for concurrent b-tree
- algorithms. However, we made significant changes and (we feel)
- improvements. We have particularly tried to simplify the algorithms to
- reduce the implementation and especiaslly debugging effort. There are
- also a lot of complex details involved in building a complete
- implementation. The goal of this document is to describe and explain
- the major design decisions and any subtleties that arose.
-
- 1. Tree format
-
- 1.1 The WB-tree (the "wanna-be a tree")
-
- We use the B-link tree format (see SAGIV 86). Each leaf block
- contains some number of (key,value) pairs; each non-leaf (ie, index)
- block contains (key, down-pointer) pairs. Each block contains one
- additional key value, the SPLIT key, that is strictly greater than any
- other key in the block. The tree is augmented by chaining together
- the nodes of each level of the tree in split-key order (the NEXT pointers).
-
- We allow the tree to be missing any index-insertion or aock-deletion
- updates so long as certain key-order invariants are maintained:
-
- A. Blocks are chained in order of increasing split key at each level,
- and all valid blocks appear in the chain.
-
- B. Split keys are unique at each level.
-
- C. Every DOWN pointer from level N+1 points to a block B at the next lower
- level N whose split-key is less than or equal to the key associated with
- the pointer.
-
- C1. If it is less, then it must be the case that (a) a block B' with
- that split key exists at level N; (b) it is reachable from B by
- following NEXT pointers; and (c) no pointers to either B' nor any
- blocks between B and B' exist in level N+1. We call the subchain from
- B through B' the SPAN of the (key,ptr) entry (split(B'),B).
-
- C2. It is invalid for a down pointer to contain a key not present as a
- split key at the next level.
-
- Note that a key in an index must match its block's split key exactly;
- if a key K is less than the split-key S of the block B
- it points to, searches for intervening keys will be misdirected (to
- the next block); and if K is greater than S, then splits of the block
- after B at key values K' where S<K'<K will be mis-inserted in B's
- paren, because K' should logically go AFTER K, but K'<K.
-
- The notion of the span of an index entry is useful. We note that each
- block split can be thought of as an EXTENSION of some span at the
- next-higher level, while each parent-insert-update can be thought of
- as a corresponding span REDUCTION. A span that has only one block in
- it will be called FULLY REDUCED; a b-tree is fully reduced when all
- its spans are fully reduced, meaning that all pending/deferreed
- insert-updates have been performed. Lastly, we can express rule C1
- more succinctly in terms of spans:
-
- C,C2. SPANs must be well-formed (span must be closed; keys must
- match exactly)
- C1. The SPANS of the entries at any index must not overlap.
-
-
- 1.2 Key/value order; uniform leaf/node format
-
- We originally had index nodes in (value,key) order because it made
- insertions simple, but abandoned that because it made all the
- insertion/deletion propagation code special-case. In fact, because of
- the asymmetry of INSERT and DELETE, either one or the other will
- require that a single operation sometimes modify two blocks (see
- INSERTION METHOD). However, using (key,value) ordering at all levels
- of the tree simplified the code considerably.
-
- 2. Split keys
-
- Split keys seem to be an accepted method for B-tree organization.
- In a Blink-tree, wherer one might potentailly have to chain forward
- to find a desired key, the split key allows one to determine when
- the search can be terminated. Having the split key at the END of
- the block saves one block access per search. (If the split key were at
- the front of the block, one would have to access the next block
- in the chain to determine that a search could terminate.)
-
- 2.1 Last-key values
-
- It is useful to define a "largest key" to be the split key of the last
- at any given level; the code is made more complex if the last key is
- some special value (eg, the null string) that doesnt follow
- lexicographic order, not to mention the problems that arise with
- inserting into empty (last) blocks. We've reserved the set of strings
- starting with #\255 as split keys only; they cannot be used as real
- key values. Last-key values are of the form xFF<level>. This makes
- the last key in level N+1 strictly larger than any in level N, so the
- end-of-chain key for level N need not be treated specially when it is
- inserted as a key at level N+1.
-
- 2.2 Fixed split keys/Independence of levels
-
- We make the split keys at index levels act like those at the leaf
- level, that is, the split key in a block NEVER CHANGES. The advantage
- is that inserts and deletes cannot cause split-key changes, avoiding
- the need for code to propagate these changes, which are a pain to
- test. It also makes the levels genuinely independent, a useful
- conceptual simplification.
-
- The disadvantage is that this introduces a "dead zone" between the
- last key-value pair of an index block and its split key. This
- complicates the code slightly, and makes the average search path
- slightly longer than logN (by a constant factor of 1+(1/BF), where BF
- is the branching factor of the tree). More annoyingly, it interacts
- with insertions for block splits (see INSERTION).
-
- 3. Insertion method
-
- Insertion in a index needs to appear atomic so that the key-order
- invariant is maintained. Since we are using key/value-ordered index
- blocks, parent-update insertion has a special extra step: we insert
- the new key and pointer in index block B, then swap value fields with
- the next entry in sequence. Unfortunately, when the insertion is the
- end of the block, the next entry is in the next block B', so two
- blocks must be modified. (The advantage of value/key ordering is that
- insertion never modifies more than one block; however, you get screwed
- in the code for deletions instead.) The code checks for this case and
- locks both blocks before proceeding. (An alternate solution considered
- is to back up the split key of B so that the new key would go at the
- front of B', but that introduces other problems.)
-
- While the above method preserves correctness (even under concurrency),
- there is unfortunately a 2nd-order screw WHICH STILL NEEDS TO BE
- IMPLEMENTED: how to write the 2 blocks in an order that preserves
- correctness in case of disk crash. If only B is written, the database
- will contain 2 pointers to the block whose split caused the insertion;
- if only B' is written, the pointer correctness will be violated. (It
- would seem that this kind of problem is inherent to any operation that
- needs to maintain consistent changes across multiple blocks.) The fix
- seems to be to surround the write with a notation in the root that
- says this special case is being exercised and indicates what blocks
- were being modified (B,B') and what the pointers should be:
-
- Write (B,B',down-ptrs) to Blk 0
- Write B'
- (Write new block if B was split by insert)
- Write B
- Remove flags from Blk 0 and write.
-
- We elect to write out block B' first for consistency with the case
- where B' is a new block created by a block split. The screw case will
- create a damaged DB, but the danger of deleting a doubly-pointed-at
- block is avoided. This is adequate information to recover directly.
- (JONATHAN?)
-
- (Other alternatives were considered. Backing up the split key at next
- level allows a subsequent insertion to be atomic, but the backing-up
- operation has the same two-block consistency problem in keeping ITS
- parent index consistent. If the parent key isnt backed up, subsequent
- splits of the following block will result in incorrect updates at the
- next level.)
-
- 4. Deletion
-
- Deleting can be divided into two parts: the block to be deleted must
- be unlinked from both parent and predecessor, after which the block
- can be reclaimed.
-
- 4.1 Unlinking the block
-
- At the moment, we've decided to simplify the delete operation by
- requiring that both the parent and the previous block be available for
- modification, else the deletion fails (is deferred). This makes block
- deletion an atomic change, which avoids several problems introduced by
- permitting partial deletions (see CONCURRENCY).
-
- Alternatives: WANG 90 gives a clever way to do deletion w/o PREV, by
- swapping blocks. This seemed too hard to make bulletproof for
- concurrency so we stuck with the PREV method, which after all only
- does extra block accesses 1/BF of the time.
-
- Crashes during DELETE: we still be to determine an order for the
- writing of the parent and predecessor blocks in a delete.
- We elect to write out the PARENT block first, so that if the system should
- crash the chain will still be intact. The block that we were trying to
- delete will still be in the chain, but it cannot be used, it is "dead."
- [Section II.3 discusses how to dead with "dead" blocks.]
-
- 4.2 Reclaiming the block
-
- One major change from SAGIV is that SAGIV assumed multiple copies of a
- block could exist, which makes reclaiming deleted blocks complex (he
- suggests a timeout-based protocol). We opted instead to track how
- many pointers to each block are extant by adding a lock for that
- purpose, the NAME lock (essentially, a DELETE lock). A routine must
- get NAME lock on a pointer BEFORE releasing READ/WRITE on the block
- from which the pointer is obtained. (SAGIV's method is almost
- certainly more efficient in the sense that our method incurs overhead
- on every operation as opposed to just on the problem case of a READ
- request during a WRITE; several empirical studies of such tradeoffs
- support this conclusion.) On the other hand, NAME lock is useful for
- other things, such as insuring that the block you are PREVing from
- cant be deleted while you're looking for its parent or predecessor...
-
- 4.3 Non-symmetry of INSERT and DELETE
-
- It is worth remembering that INSERT and DELETE are not symmetric, in
- the sense that a postponed insertion is NOT equivalnt to deleting a
- (KEY,PTR) pair. The latter operation leaves a block whose pointer is
- missing unaccessible via the index, while the former leaves the block
- accessible through the NEXT pointer chain.
-
- This asymmetry has been the death of more proposals for fixing various
- problems involving concurrency than I care to recall!
-
- EXAMPLE:
-
- 4.3.1. Start with blk p with split key k at level n;
- the index (levl n+1) contains: ...k,p,..
-
- 4.3.2. split p at k', adding block p';
- the index shoudl now contain ...k',p,k,p'..
-
- 4.3.3. Now neither deleting p nor p' yields the original index contents:
- If we delete p, the result is ...k,p',...
- If we delete p', the result is ...k',p,...
-
- Therefore, it is not possible for a subsequent delete to "cancel"
- an insert; the deletes must either wait for the relevant inserts
- to complete or else do special work to maintain correctness.
-
- 5. Non-delete of last block in the chain
-
- We currently do not support deletion of the last block at any level,
- that is, the one with the end-of-block key. This is because this
- deletion requires special case to retain the end-of-block key. One way
- to achieve this is to copy forward the contents of the next-to-last
- block, and deleting that instead. p[There are details to be worked
- out, eg, preservation of correctness during this operation.]
-
- 6. Prev
-
- [DESCRIBE HOW IT WORKS]
-
- SCREW CASE: FIX UNDERSTOOD BUT NOT YET IMPLEMENTED: if down-pointers
- are missing to blocks immediately after the first block of a level,
- PREV will miss those blocks. The problem occurs when a PREV-BLOCK of
- block B at level N occurs, causing a PREV at level N+1. If
- down-pointers are missing, the block B' associated with B's split key
- may be some predecessor of B rather than B itself. In this case
- PREVing at level N+1 is wasteful but correct; it would require fewer
- block accesses to chain from B' than from PREV of that entry. However,
- if the B' entry is the FIRST at level N+1, PREV will erroneously
- conclude that it is at the start of the chain. It either has to (a)
- always look at B's entry at level N+1 to check that the current
- block's ptr is really up-to-date, or just check when it hits the
- START-OF-CHAIN.
-
- 7. Root Block protocol
-
- 7.1 Root uniqueness
-
- We guarantee that the root block number never changes and thus can be
- used as a unique identifier for a given WB-tree. Other systems provide
- a unique tree ID by introducing a level of indirection in root
- references; this is inefficient, as root references are frequent. When
- the root is split, we allocate a new block to hold the data that would
- have remained in the root block, then use the old root block for the
- new root. This does mean that one cannot depend on a block's ID being
- unchanged if it splits!
-
- 7.2 NOT IMPLEMENTED: ROOT DELETE and Reducing number of levels in a
- tree.
-
- 8. Other tree organizations
-
- [There are other possibilities for tree layouts. It is possible that
- some of these may simplify operations that in the current layout are
- complex, such as the insert-screw-case. Other possible choices
- include:
-
- - split key at start rather than the end of the block;
- - reversing the order of (key,value) pairs in the blocks;
- - not using the split key
- - using TWO split keys (one at each end of the block);
- - runnning the chains backearfd rather than forward;
-
- My (RJZ's) favorite alternate assumption is allowing multiple versions
- of blocks to exist, as in SAGIV 86. This would mean:
-
- - processes would be allowed to maintain STATE in a block, such as
- current position, such as for successive NEXTSs; on the other
- hand, NEXT would have to check for NEXT(X)<X and skip forward
- accordingly.
- - W/R interlock overhead should be greatly reduced (as WRITEs do not
- have to wait for reads);
- - the need for NAME locking is greatly reduced, since we no longer
- would be trying to count the of processes that "know" about each
- block;
- - reclaiming deleted blocks is harder;
- - other issues to be evaluated: effect on rebalanace, PREV,
- and the deferred-operation rpotocol we've developed.
-
- Perhaps we can explore the interrelationships in some
- methodical way someday...]
-
-
- II. Concurrency
-
- 1. Name access (and deleted-block reclamation)
-
- In order to be able to explicitly know when a block is safe to delete,
- we insist that a user must get NAME lock on a pointer BEFORE releasing
- READ/WRITE on the block from which the pointer is obtained. NAME lock
- is useful for other things, such as
-
- - insuring that the block you are PREVing from cant be deleted while
- you're looking elsewhere;
-
- - CHAIN-PUT uses it to insure that the block just split doesnt disappear
- during the call to PARENT-INSERT-UPDATE
-
- - others?
-
-
- 2. Fail-out protocol/access conflict strategy
-
- Blocked operations are at the moment simply going to fail with an
- error code of RETRYERR, meaning that they can be safely retried later.
- THe current idea is to use this whenever a READ-WRITE conflict occurs.
- (This would not be necessary using SAGIV's method.) However, since
- various other lockout and wait conditions can occur -- waits for block
- reads, waits on NAME locks, waits on interlocked operations -- some
- such facility would be needed anyway, so it seems reasonable to try to
- use it to handle READ-WRITE blocking as well.
-
- NOT REALLY IMPLEMENTED YET due to the complexity of reorganizing the
- code to pass up the appropriate information in all cases.
- (Top-level routines return these codes but internal routines
- dont really use it yet, so retryability isnt really there yet.)
-
- 3. Deferred Index Updates and concurrency
-
- The basic strategy is to allow a key insert/delete that causes a block
- to split/become empty to complete, but to then queue up the
- parent-update/block delete on some process that will retry deferred
- updates in the background until they succeed.
-
- The problem is that deletes contain two separate operations that can
- wait: the parent update and the predecessor update. The two updates
- can be done separately if the block is first marked "dead" so that no
- insertions into it can occur. Unfortunately, there is a class of rare
- and complex screw cases involving the correct ordering of deletes and
- inserts that happen to involve the same key. One might for example
- delete a block, deleting its key, and then subsequent splits can
- reintroduce and re-delete it. If operations at the index level are
- deferred, the ordering of these deferred operations determines whether
- the resulting tree is correct.
-
- Making block-delete atomic (as defined above) greatly simplifies this
- process. The block being inserted/deleted can then serve as its own
- semaphore.
-
- We have decided to adopt a lazy update strategy. That is, rather than
- keeping around queues of pending parent-updates, we just throw them
- away if they can't be executed immmediately. We can get away with this
- because the updates for level N+1 can be reconstructed from the chain
- at level N.
-
- Now, a deferred update only affects performance if the affected path
- is actually encountered: if we find we have to chain across two
- blocks, it means a parent update hasnt occurred yet; if we encounter
- an empty block, it means a delete update hasn't occurred. Our idea is
- to fix these "inefficiencies" on demand, that is, only when we run
- into one will we expend the efort to fix it.
- For the moment, assume ALL block deletes and insert updates are
- deferred.
-
- The basic algorithm is:
-
- (a) If we have to chain forward in searching for a key, there must be
- pointer missing at the next level (deferred insert). So we attempt to
- insert it. Forward chaining to reach the NEXT and PREV of a key
- is normal and hence shouldn't cause update attempts.
-
- (b) Whenever we reach an empty block, we attempt to delete it, except
- for the case where we are doing an insert which should go in that
- block. The delete attempt will fail UNLESS the relevant parent updates
- are complete, that is, if and only if the block is within a fully
- reduced span.
-
- One major concern has been that an I/O failure during a block delete
- can leave a "dead" block, that is, one which can't be reached. The
- problem is that once a block becomes dead we need to prevent it from
- being inserted into or restored into use, because that could result in
- entries being out-of-order (suppose that, while the block's dead, some
- inserts of keys less than its split key occur. They'll be directed
- into the next block by the index, and be posted there. If we then
- restore the block, the key ordering will be incorrect.) But it turns
- out that we can prevent this by observing which B-tree operations can
- encounter which types of deferred-update situation, as follows:
-
- If we think of the tree as a set of nested SPANS, where the SPAN of an
- index entry is the set of entries in the blocks it SPANS, we note that
- during operations using search only - GET,PUT,REM -- the locus of
- search stays strictly within the SPAN of some entry in the root. We
- enforce that a block not be deleted until its parent pointers are
- completely updates, that is, its pointer has a span of exaclty one
- block. Now suppose a block delete fails halfway, leaving a empty
- block without any parent pointer. Such a block is unreachable by
- FIND-NODE and hence by INSERT! This means that if INSERT encounters
- an empty block, it must be valid to insert into it.
-
- This has several interesting consequences:
-
- (a) This means that only the operations that can possibly chain ACROSS
- spans have to worry about dead blocks: the NEXT and PREV operations.
- (What exactly is the effect on PREV?)
-
- (b) This means that "dead" blocks can't be deleted (unless we look for
- them specially). But they (a) are only created by a rare kind of
- event, and (b) won't hurt anything, so long as we arrange that NEXT
- and PREVs ignore the empty blocks (ie, ignore the value of their split
- keys).
-
- (c) The way we have defined deferred-delete reconstruction, if the
- block isn't within a fully-reduced span, the delete simply can't
- succeed. This means that there's no point in trying to delete a blank
- block when we're chaining though a span, that is, we should only try
- to delete empty blocks encountered (1) when following a DOWN pointer
- in FIND-ENT, or (2) due to a NEXT operation (or PREV?).
-
- [Having realized this, we can actually let block-deletes be two-part
- operations (again). The only additional complexity is that then we'd
- need to implement a method for detecting dead blocks, that is,
- differentiating them from empty blocks in the middle of large spans,
- which we mustn't try to delete. We just check if the block is
- contined in some span! HOW TO DO THAT EFFICIENTLY?]
-
- In detail, the method of detecting deferred operations goes like this:
-
- 1. RECONSTRUCTING DEFERRED PARENT-UPDATES (missing DOWN pointers):
-
- Whenever we have to follow the NEXT pointer of a block B (in FIND-NODE), we
- should attempt a PARENT-INSERT-UPDATE using the (key,value) pair
- (split(B),next(B)). FIND-NODE needs to be fixed to chain right rather
- than down when the "dead zone" is encountered; and it should not attempt a
- PARENT-INSERT-UPDATE in this case.
-
- PARENT-INSERT-UPDATE can fail if:
- (a) some other process is already doing the update (the first one to
- lock the parent wins, the other will fail out);
- (b) the key split(B) is already present (means a DELETE is pending --
- this shouldn't really occur, though);
- (c) the update requires a block split but no blocks are available.
-
- 2. RECONSTRUCTING DEFERRED BLOCK DELETES:
-
- Whenever we reach an empty block in FIND-NODE or a NEXT/PREV
- operation, we'll attempt a PARENT-DELETE-UPDATE.
-
- PARENT-DELETE-UPDATE should fail when
- (a) the key is found but points to a different block (meaning
- the containing span isn't fully reduced);
- (b) the block is NAME-locked (this means its safe to leave name-locks
- around, the worst that can happen is they cause block deletes to
- be deferred some);
- (c) someone else is doing a PARENT-DELETE-UPDATE on this block
- (this can be guaranteed by having the delete process first
- name-lock the block being deleted;
- (d) some block needing to be modified (parent or prev) is locked.
-
- 3. We need to fix the code that does the NEXT-KEY operation
- to not stop at potentially-dead blocks. CHAIN-FIND need NOT
- ignore blank blocks.
-
- In practice, we'll want to trigger the update routines at the normal
- times as well, ie, try insert-updates after block splits and
- delete-updates after a block becomes empty.
-
- There are a number of new statistics we should keep; these include:
-
- * deferred parent-insert-updates (PUDs)
- * insert-updates that succeed
- * insert-updates that fail (measures overhead of the method)
- * insert-update failure rate
-
- * deferred block deletes
- * deferred block deletes that succeed
- * deferred block deletes that fail (measures overhead of the method)
- * delete failure rate
-
- * count of dead blocks found (requires extra work)
-
- (The number of chain-forwards is also a meaure of the overhead.)
-
- [Note: this mechanism aldo supports a simple queuing method: keep a
- list of the block numbers at which updates were deferred, and in times
- of low usage do FINDs to them, which will update them as a
- side-effect...]
-
- ------------
-
- [Other strategies were considered but were either significantly more
- complex or overheady, or introduced unnecessary delays:
-
- One hack is to serialize the queue of potponed index INSERT and DELETE
- operations by brute force: to do them in exactly the order they
- arrived in. Possibly simpler alternative method: sort by key and
- timestamp them!
-
- I think we also decided that the interpreter could simply devote a
- process to each deferred operation, since we want to shift resources
- toward accomplishing the deferred ops if too many back up.
-
- WANG 90 maintains a queue of deleted operation on the DESTINATION
- block. This has the disadvantage that whenever a block is split or
- merged the deferred-operation queues have to be split or merged as
- well -- uggh!.]
-
-
- III. Buffer, I/O, and Free-List Management
-
- 1. When to reclaim buffers - age vs. level vs. dirtyness
-
- THIS IS A PERENNIAL NUISANCE -- THERE's a complex system in place which
- hasnt really been evaluated or tuned, and theres also a proposal by
- Jonathan to simply use the first free (or freeable) buffer one finds.
-
- 2. UPDATE-ACCESS:
-
- It seems to me there was some case where it wasnt
- OK to just do a release to #f and an update from there -- but i cant
- think of it offhand...
-
- 3. Deferred writes of data blocks
-
- Note: this is a different sort of deferral from the "deferred index
- updates" discussed earlier. Those deferred operations were correctness-
- presetrving; these are not. The idea here is that we can reduce i/o
- traffic if we can safely lose a "small" number of updates.
-
- 3.1 Description and purpose
-
- The general idea here is that we can reduce I/O traffic by
- deferring writes of leaf blocks, in the sense that the updates can be
- lost without compromising the structure of the database. This only
- works where the data updates in question are not critical to database
- or application. Currently, we always defer leaf updates -- both PUTs
- and REMoves -- to data trees, where a data tree is any tree not of
- type DIRECTORY. (DIRECTORY leaf blocks are written to disk after every
- update.) The idea is that the user application should have control over
- how often the data blocks are written.
-
- Also, deferral of PUTs and DELETES should be separately controllable.
- This feature is needed for example by the database itself in
- maintaining the free list: we can afford to defer INSERTS of deleted
- block, because the worst that could happen is that a free block gets
- lost. But DELETES from the free list must update immediately, else a
- block could be doubly-allocated.
-
- 3.2 Implementation
-
- The handle is being extended to contain the following boolean values:
-
- (a) SAP: save block after PUTs
- (b) SAR: save block after REMOVEs
- (c) SAC: force block save after cached block changes
- (d) FAC: flush buffer entirely after cached block changes
- (not currently implemented -- future functionality)
-
- These bits are used for example as follows:
-
- DIRECTORY: SAP=SAR=1;
- FREE LIST: SAR=1; SAP=SAC=0;
- USER DATA: SAP=SAR=SAC=0;
-
- The state of these bits can be changed at any time using (HAN-SET-WCB
- han new-bits). The user can insure that the current block has been
- written out by calling WRITE-CACHE-BLK, and can insure that all
- modified blocks have been written out by calling WRITE-MODIFIED-BLKS.
- At the moment, for safety, directory trees force HAN-SAP and HAN-SAR
- when opened or created. Also, OPEN-SEG forces the free list write control
- bits to be as shown above, regardless of the block type of the free-list.
-
- 4. Caching of last (leaf) block used
-
- Works great
-
- NOT IMPL YET: We could actually be cleverer and cache the parents of
- every node, and even the PREVs, using the same TRY-GET-ENT protocol!
-
- 5. Note.
-
- Multiple READ access is not supported yet.
- READ-READ conflicts can occur. We seem to recall noting that
- this limitation kept us from encountering some other problem,
- whose identity is lost for the moment.
-
- 6. Free-block management
-
- Outline:
-
- - free list is implemented as a B tree (its root block is in
- block 2 of the file)
- - with cache (free-list-cache, one per segment)
- - cache gets filled (to 1/2 full) when it empty and a block is needed;
- it gets flushed (to 1/2 full) when its about to overflow.
- - cache can be filled either from the free-list-tree or if that's
- empty, the file is extended.
- - cache filling is now done efficiently using scan-delete. This
- also means that only one disk write will be done per
- fill (in most cases).
- - cache filling is interlocked so only one process can be filling the
- cache at a time.
- - currently cache emptying is inefficient as each write of of a block
- to the tree causes an I/O [deferred writes is intended to fix this]
-
- Current problems:
-
- Last week (1/8) we concluded that (all efforts to date
- notwithstanding) there were still failure cases in free-block
- management under concurrent operation, because of problems like:
-
- - multiple processes can eat up however many blocks are in the cache,
- meaning we still haven't handled the case where a PUT to the free list
- causes a block split (necessitating a recursive call to FREE-BLOCK).
-
- - Conversely, it is possible for the cache to become overfilled
- by a cache filling operation, if in the meantime OTHER processes fill
- the cache with deleted blocks.
-
- - We need to be sure in general that concurrent fills and/or empties
- don't overfill or overempty the cache.
-
- On the positive side, we realized that we NEED NOT allow enough
- free blocks for a free-list insert to split the whole tree -- any
- split except the leaf split can simply be postponed if there isnt a
- free block available at that time! (And similarly for any insert-updates
- that happen to be triggered by free-list accesses.)
-
- Come to think of it, if we happen to run out of blocks during a
- free-list insert, its OK to let the insert fail, we just lose one disk
- block!! That may just be the answer!!
-
- 7. Buffer management routines
-
- Two routines have been built for purging buffers.
-
- FLUSH-BUFFER(ENT) writes out ENT if it is dirty and unlocked.
- It returns TERMINATED if ENT is locked, RETRYERR if the write is
- attempted and fails, and SUCCESS otherwise.
-
- PURGE-BUFFER(ENT) writes out ENT if it is dirty and then
- frees up the buffer. This IGNORES the acceess status of the buffer,
- so it sould not be called by users; it always sreturns SUCCESS.
-
- Use (DO-SEG-BUFFERS SEG# FUNC) to apply a function to all the buffers
- of a given segment; for example, (DO-SEG-BUFFERS SEG# FLUSH-BUFFER)
- can be used to guarantee that segment SEG#'s disk file is up to date.
- DO-SEG-BUFFERS halts if FUNC returns other than SUCCESS; the result of
- FUNC is returned. SUCCESS is returned if all buffers have been
- successfully processed. To process all segments, use SEG #=-1.
-
- (CHECK-BUFFER ENT) chekcs that the buffer is written and unlocked,
- and repairs those that are not.
-
-
- 8. Other issues to document:
-
- - bucket-locking method
- - treatment of access conflict conditions
- - amnesia-ent
- - caching criteria (eh?)
-
- IV. Error handling
-
- The problem: Calls need to return adequate information to distinguish:
- 1. restartable operations, for example
- - update-parent on insert
- - update-parent on delete
- - unlink for delete
- 2. non-restartable failures (eg, disk i/o error)
- 3. null results (eg, key not found)
- 4. "real values", eg, the length of a returned string, or
- an ENT pointer vs. NULL.
-
- The solution: use a bounded range of negative values as "failure"
- codes, leaving the non-negative return values available as "success"
- codes. The canonical success code is 0, but a routine that needs to
- return a value (string length, ENT pointer) can do so and have that
- value interpreted as a "success" code as well.
-
- There are "degrees" of failure,and the negative codes are partitioned
- according to increasingly severity, as follows:
-
- SUCCESS 0 ; successful execution
- NOTPRES -1 ; successful execution, no data present or no change made
- TERMINATED -2 ; failure, no damage, caller can retry operation
-
- RETRYERR -10 ; failure, no damage, caller can retry operation
- ARGERR -15 ; failure, no damage, call was in error
-
- NOROOM -20 ; failure, no damage, out of room in file
- TYPERR -30 ; failure, file or object was not of correct type
-
- IOERR -40 ; i/o error, DB may be damaged
- STRANGERR -45 ; internal error, DB may be damaged
- UNKERR -90 ; placeholder code
- MAXERR -100
-
- The first class represent operations that completed without error.
- The second class represent operations that failed to complete,
- but are guarantedd to leve the DB in a correct state and are
- retryable (or easily correctible).
- The third class represent operations that failed to complete,
- did not damage the database, but are not easily fixable or restartable.
- The last class represent error conditions in which the DB was
- corrupted, or diuring which DB corruption was detected.
-
- The predicate (ERR? code) returns #t if the return code is
- within the range NOTPRES-MAXERR; the predicate (REALERR? code)
- returns #t if CODE is an actual error, as opposed to a "not there"
- or "stop processing" message.
-
- V. Misc
-
- 1. "Largest keys" (End-of-chain marker strings)
-
- Because we've reserved keys with a first byte of FF for split keys,
- these keys are unavailable to the user.
-
- 2. NULL keys
-
- WB supports use of the null string as a key (Sliced Bread treats
- the key specially).
-
- 3. Searching on minimum and maximum keys
-
- Given that the null string may be used as a key, there needs to be a
- way to specify a "least" key such that NEXT(least) yields the first
- key, even if it is NULL. The special key strings with length s of -2
- and -1 are provided to represent the minimum and maximum key values,
- respectively. These keys are only useful with NEXT and PREV; data
- cannot be associated with these keys.
-
- 4. Need to document TSCAN, TSTATS.
-